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Abstract 

The identification of low-energy conformers for a given molecule is a fundamental problem in 
computational chemistry and cheminformatics. We assess here a conformer search that employs a 
genetic algorithm for sampling the low-energy segment of the conformation space of molecules. 
The algorithm is designed to work with first-principles methods, facilitated by the incorporation of 
local optimization and blacklisting conformers to prevent repeated evaluations of very similar so¬ 
lutions. The aim of the search is not only to find the global minimum, but to predict all conformers 
within an energy window above the global minimum. The performance of the search strategy is: 
(i) evaluated for a reference data set extracted from a database with amino acid dipeptide conform¬ 
ers obtained by an extensive combined force field and first-principles search and (ii) compared to 
the performance of a systematic search and a random conformer generator for the example of a 
drug-like ligand with 43 atoms, 8 rotatable bonds and 1 cis/trans bond. 
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Introduction 


One of the fundamental problems in cheminformatics and computational chemistry is the identi¬ 
fication of three-dimensional (3D) conformers that are energetically favourable and likely to be 
encountered in experiment at given external conditions.® Conventionally, these conformers are 
often characterized by specific, fixed sets of nuclear coordinates or ensembles thereof, and their 
potential energy is given by the electronic degrees of freedom in a Born-Oppenheimer picture 
of the chemical bond. A variety of conformations can be adopted by flexible organic molecules 
as the multi-dimensional potential-energy surface (PES) usually contains multiple local minima, 
with a global minimum among them. Only when the relevant conformers are known, one can 
predict and evaluate chemical and physical properties of the molecules (e.g. reactivity, catalytic 
activity, or optical properties). In many practical applications, the PES minima are taken as start¬ 
ing points to explore the free-energy surface (EES). Generating conformers is an integral part of 
methods such as protein-ligand docking®® or 3D pharmacophore modeling.® The propensity to 
adopt a certain conformation strongly depends on the environment and possible interactions with 
other compounds. It has been shown that the bioactive conformation of drug-like molecules can 
be higher in energy than the respective global minimum® and that different 3D conformations may 
be induced by specific interactions with other molecules.® Thus, it is crucial to focus not just on 
a single, global minimum of the PES, but instead to provide a good coverage of the accessible 
conformational space of a molecule yielding diverse low-energy conformers. 

The exploration of a high-dimensional PES is challenging. A selection of popular sampling 
approaches utilized in conformer generation is summarized in Table We focus specifically on 
genetic algorithms (GAs)l^^^® that belong to the family of evolutionary algorithms (EAs) that are 
frequently used for global structure optimization of chemical compoundsGAs for chem¬ 
ical structure searches implement a ’survival of the fittest’ concept and adopt evolutionary princi¬ 
ples starting from a population of, most commonly, random solutions. GAs use the accumulated 
information to explore the most promising regions of the conformational space. With this, the 
number of unhelpful evaluations of physically implausible high-energy solutions can be reduced. 
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Table 1: Popular sampling approaches. Names of freely available programs are highlighted in 
boldface. 


Method 

Description 

Implemented, e.g., in 

grid-based 

based on grids of selected Cartesian or internal 
coordinates (e.g., grids of different torsional an¬ 
gle values of a molecule) 

CAESAR, 0 Open Babel, 

Confab,^ MacroModekC^ 

MOEtS! 

rule/knowledge - 
based 

use known (e.g., from experiments) structural 
preferences of compounds 

ALFA,E] confect,-!^ CO- 

RINA and ROTATE,!™ cOS- 
mos,!™omegaI2S1 

population-based 

metaheuristic 

improve candidate solutions in a guided search 

Balloon,!2Il Cyndi!22 

distance geometry 

based on a matrix with permitted distances be¬ 
tween pairs of atoms 

RDKiPl 

basin-hopping^ / 

minima hoppingl^ 

based on moves across the PES combined with 

local relaxation 

ASE,!^ GMIN,!^ TINKER 

SCAn!^ 


Examples of GA-based structure prediction applications include: (i) conformational searches for 
molecules like of unbranched alkanes^ or polypeptide folding;!^ (ii) molecular design(iii) 
protein-ligand docking;!^ cluster optimization;®S2l (y) predictions of crystal structures;^^^^(vi) 
structure and phase diagram predictions Further, Neiss and Schoosl^ proposed a GA including 
experimental information into the global search process by combining the energy with the experi¬ 
mental data in the objective function. Since GAs typically rely on internal, algorithmic parameters 
that control the efficiency of a search, a meta-GA for optimization of a GA search for conformer 
searches was proposed by Brain and Addicoat.l^ 

Aside from the search algorithm itself, the choice of the mathematical model for the PES is crit¬ 
ical to ensure results that reliably reflect the experimental reality. Among the available atomistic 
simulation approaches, "molecular mechanics" models, i.e., so-called force fields are especially 
fast from a computational point of view and therefore often employed. However, the resulting 
predictions depend on the initial parametrization of a particular force field and can lead to con¬ 
siderable rearrangements of the true PES for molecules that were not included in the parameteri¬ 
zation procedureOn the other end of the spectrum of approaches, the PES can be faithfully 
represented based on the "first principles" of quantum mechanics. Indeed, benchmark quality ap¬ 
proaches such as coupled cluster theory at the level of singles, doubles and perturbative triples 
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(CCSD(T)) are almost completely trustworthy for closed-shell molecules, but still prohibitively 
expensive towards larger systems and/or large-scale screening of energies of many conformers. 
Density-functional theory (DFT) approximations are an attractive alternative to balance accuracy 
and computational cost. The choice of the approximation is critical when using first principles 
methods like DFT. It has been shown that it is necessary to incorporate dispersion effects for 
(bio)organic molecules and their complexes The challenge of including long-range inter¬ 

actions has been met for example by the dispersion correction schemes described by Grimme^^^*^ 
or by Scheffler and Tkatchenko,l^2^^ but validating the DFT approximation employed is critical. 
In fact, subtle energy balances of competing conformers can require relatively high-level DFT 
approximations for reliable predictions. I— 

The aim of our work is to develop and test an approach to sample the PES of small to medium 
sized (bio)organic molecules without relying on empirical force fields, utilizing instead electronic- 
structure methods for the entire search. With the molecular structure problem in mind, we define 
following requirements for the search strategy and implementation: 

• Global search based on user-curated torsional degrees of freedom (bond rotations). 

• Local optimization based on full relaxation of Cartesian coordinates and avoidance of re¬ 
computing too similar structures to ensure both efficient sampling and economic use of a 
computationally demanding energy function. 

• Design of the program in a way to use an external and easily exchangeable electronic struc¬ 
ture code (in our case FHI-aimsI ^^ I ^^ ^. 

• Simple input of molecules (composition and configuration) by means of SMILES codes 

• A robust and simple metaheuristic that ideally identifies the complete ensemble of low- 
energy conformers. 

• Freely available and with a flexible open-source license model. 

• Support for parallel architectures. 
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Based on these requirements, we present in this work a eonformational seareh strategy based on 
a genetie algorithm. We provide a detailed deseription of our approaeh and a software imple¬ 
mentation Fafoom (Flexible algorithm for optimization of moleeules) that is available under an 
open-souree lieense (GNU Lesser General Publie Lieensd^ for use by others. For simplieity, 
we abbreviate ’potential energy’ with ’energy’ and ’minima of the potential-energy surfaee’ with 
’energy minima’. 

Methods 

In the following, we first motivate and explain assumptions that we met for handling 3D strue- 
tures of moleeules. Further, we outline the algorithm’s implementation and deseribe its teehnieal 
details. Finally we introduee a data set that we use as a referenee for evaluating the performanee 
of our implementation. Our work foeuses on both the ability to reliably prediet the global mini¬ 
mum and to provide a good eonformational eoverage with a eomputationally feasible approaeh. To 
aehieve that, we formulate some speeifie algorithmie ehoiees at the outset: (i) only sterieally feasi¬ 
ble eonformations are aeeepted for loeal optimization; (ii) a geometry optimization to the next loeal 
minimum is performed for every generated eonformation; (iii) an already evaluated eonformation 
will not be evaluated again. 

Choice of coordinates 

In eomputational ehemistry, at least two ways of representing a moleeule’s 3D strueture are eom- 
monly used, either Cartesian or internal eoordinates. The simplest internal eoordinates are based on 
the ’Z-matrix eoordinates’, whieh inelude bond lengths, bond angles and dihedral angles (torsions) 
and ean also be referred to as ’primitive internal eoordinates’. These eoordinates refleet the aetual 
eonneetivity of the atoms and are well suited for representing eurvilinear motions sueh as rotations 
around single bonds.^^Bond lengths and bond angles posses usually only one rigid minimum, i.e. 
the energy inereases rapidly if these parameters deviate from equilibrium. In eontrast, torsions ean 
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change in value by an appreeiable amount without a dramatie ehange in energy. Similar to the 
work of Damsbo et we use Cartesian eoordinates for the loeal geometry optimizations while 
internal eoordinates, in this work only torsional degrees of freedom (TDOFs), i.e. freely rotatable 
bonds and, if present, cis/trans bonds, are used for the global seareh. We eonsider only single, 
non-ring bonds between non-terminal atoms to be rotatable bonds after exeluding bonds that are 
attaehed to methyl groups that earry three identieal substituents. Further we allow for treating 
seleeted bonds in a cis/trans mode, i.e. allowing only for two different relative positions of the 
substituents. In eases in whieh the substituents are oriented in the same direetion we refer to it as 
to cis, whereas, when the substituents are oriented in opposite direetions, we refer to it as to trans. 

Handling of molecular structures 

Figure shows different ehemieal representations of a moleeule, here for the example of 3,4- 
dimethylhex-3-ene. Figure and B depiet the standard 3D and 2D representation of the eom- 
pound together with marked cis/trans and rotatable bonds. A SMILES (simplified moleeular-input 
line-entry system) string is shown in Figure [^. A SMILES representation^^ of a ehemieal eom- 
pound eneodes the eomposition, eonneetivity, the bond order (single, double, triple), as well as 
stereoehemieal information in a one-line notation. Einally, a veetor representation (Eigure[^) ean 
be ereated if the loeations of cis/trans and rotatable bonds are known. The veetor will store the eor- 
responding torsion angle values. Our implementation takes as input a SMILES representation of a 
moleeule, while veetors of angles are used to internally eneode different struetures in the genetie 
algorithm below. 

Frequently used terms 

Several terms need to be defined prior to deseribing the strueture of the algorithm. In the following, 
the parameters of the seareh are highlighted in boldfaee. These parameters are input parameters to 
the algorithm and need to be defined in the input file. 

A sensible geometry meets two eonstraints. Eirst, the atoms are kept apart, i.e. none of the 
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Figure 1: Different chemical representations of 3,4-dimethylhex-3-ene: (A) 3D structure with 
rotatable bonds marked in red and the cis/trans bond marked with double arrows. (B) 2D structure. 
(C) SMILES string. (D) vector representation of the molecule. The first value encodes the torsion 
angle value for the cis/trans bond and the two remaining position store the torsion angle values of 
the rotatable bonds. 

distances between non-bonded atoms can be shorter than a defined threshold (distance_cutoff_l, 
default=1.3 A). Secondly, it is fully connected, i.e. none of the distances between bonded atoms 
can be longer than a defined threshold (distance_cutoff_2, default=2.15 A). The attribute sensible 
can be used further to describe any operation that outputs sensible geometries. 

The blacklist stores all structures that: (i) were starting structures for the local optimization and 
(ii) resulted from local optimization, as they may have changed significantly during the optimiza¬ 
tion. In case of achiral molecules (chiral, default=False) also the corresponding mirror images are 
created and stored. 

A structure is unique if none of the root-mean-square deviation (RMSD) values calculated for 
the structure paired with any of the structures stored in the blacklist is lower than a defined thresh¬ 
old (rmsd_cutoff_uniq, default=0.2 A). We consider only non-hydrogen atoms for the calculation 
of the RMSD. 
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Basic outline of the search algorithm 

We implemented the genetic algorithm (GA) using the Python language (version 2.7) and employ 

the RDKit library An overview is presented in Algorithm 

# initialization 

while i < pop size’. 

X = random_sensible_geometry 
blacklist. append(x) 

X = DFT_relaxation(x) 
blacklist. append(x) 
population. append(x) 
i+=l 

# iteration 

while j < iterations’. 

population.sort(index=energy) 

(parent 1, parent2) = population.select_candidates(2) 

(child 1, child2) = sensible_crossover(parentl, parent2) 

(child 1, child2) = mutation(childl, child2) 
repeat 

(child 1, child2) = mutation(childl, child2) 
until childl and child! are sensible and are not in the blacklist 

blacklist.append(childl, child2) 

(childl, child2) = DFT_relaxation(childl, child2) 
blacklist.append(childl, child2) 
population.append(child 1, child2) 
population.sort(index=energy) 
population.delete_high_energy_candidates(2) 
if convergence criteria met: 

break 

else: 

j+=l 

Algorithm 1: Genetic algorithm for sampling the conformational space of molecules. 

Initialization of the population 

First, a random 3D structure is generated with RDKit directly from the SMILES code. This struc¬ 
ture serves as a template for the upcoming geometries. Next, two lists of random values are gener¬ 
ated: one for the rotatable bonds and one for the cis/trans values. If the resulting 3D geometry is 
sensible, the structure is then subjected to local optimization. To generate an initial population of 
size N (popsize), N sensible geometries with randomly assigned values for torsion angles need to 
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be built and locally optimized. The optimized geometries constitute the initial population. Due to 
the fact that the geometries are created one after another, all randomly built structures can but do 
not have to be made unique in order to increase the diversity of the initial population. 

An iteration of the GA 

Our GA follows the established generation-based approach, i.e. the population evolves over subse¬ 
quent generations. After completion of the initialization, the first iteration can be performed. For 
this purpose, the population is sorted and ranked based on the total energy values Et of the different 
conformers / = 1,..., A. For each individual the fitness Ft is being calculated according to: 

Emax Ei 

Pi = p ( 1 ) 

^max ^min 

Emax is the highest energy and Emin is the lowest energy among the energies of the conformers 
belonging to the current population. As a result, F = l for the ’best’ conformer and F = 0 for the 
’worst’ conformer. In the case of a population with low variance in energy values {Emax —Emin < 
energy_var, default=0.001 eV ), all individuals are assigned a fitness of 1.0. 

Selection. Two individuals need to be selected prior to the genetic operations. We implemented 
three mechanisms for the selection. 

(i) In the (energy-based) roulette wheel^ the probability pi for selection of a conformer i is 
given by: 


Pi = 



( 2 ) 


With this, the probabilities of the conformers are mapped to segments of a line of length one. 
Next, two random numbers between zero and one are generated and the conformers whose seg¬ 
ments contain these random numbers are selected. In the case when the sum of the fitness values 
is lower than a defined threshold near one (fitness_sum_limit, default=1.2) the best and a random 
individual are selected. 
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(ii) The reverse roulette wheel proceeds similarly to the roulette wheel mechanism with the 
difference that the fitness values are swapped, i.e. new fitness F* is assigned to each conformer: 



(3) 


Analogously, the probability pi for selection of a conformer i is given by: 




(4) 


(iii) In the random selection mechanism all individuals have the same chance to be selected. 

In all selection mechanisms the selected individuals must be different from each other so that 
the crossing-over has a chance to produce unique conformers. 

Crossing-over. Crossing-over is considered to be the main feature distinguishing evolutionary 
algorithms from Monte Carlo techniques where only a single solution can evolve. Crossing-over 
allows the algorithm to take big steps in exploration of the search space.^ In our algorithm, a 
crossing-over step is performed if a generated random number (between zero and one) is lower than 
a defined threshold (prob_for_crossing, default=0.95). Between the selected individuals, parts of 
the representing vectors are then exchanged. To that end, the vectors characterizing the structure 
of both individuals are "cut" at the same single position (determined at random). The first part of 
the first individual is then combined with the second part of the second, and vice versa (a scheme 
explaining the crossing-over procedure is provided in Figure SI). Crossing-over is successful 
only when the resulting vectors can be used for generating sensible geometries. Otherwise the 
crossing-over is repeated until sensible geometries are generated or a maximum number of attempts 
(cross_trial, default=20) is exceeded. In the latter case, exact copies of the selected conformers 
are used for the following step. 

Mutations are performed independently for the values of cis/trans bonds and of the rotatable 
bonds and if randomly generated numbers exceed corresponding thresholds (prob_for_mut_cistrans, 
default=0.5; prob_for_mut_rot, default=0.5). For each, the number of mutation events is deter- 
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mined by a randomly picked integer number not higher than the user-defined maximal number of 
allowed mutations (max_mutations_cistrans and max_mutations_torsions). For each mutation, 
a random position of the vector is determined and the mutation is chosen to affect the value of that 
variable. In case of cis/trans bonds, the selected value is changed to 0° if it was above 90° or below 
—90°, else it is changed to 180°. A selected rotatable bond is changed to a random integer between 
— 179° and 180°. A mutation step is only successful if the geometry built after the mutation of the 
vector is sensible and unique. Otherwise the entire set of mutations in a mutation step is repeated 
until a sensible and unique structure is generated or a maximum number of attempts (mut_trial, 
default=100) is exceeded. In this case, the algorithm terminates. The mutation is performed for 
both vectors generated via crossing-over. 

Local optimization and update. As the computational cost of the local optimization is signifi¬ 
cantly higher than all of the other operations,only unique and sensible structures are subject 
to local optimization. The structures are transferred to an external program for local geometry 
optimization (here: FHI-aimsI ^^ l ^^ i see section DFT calculations). The application of local op¬ 
timization was shown to facilitate the search for minima by reducing the space the GA has to 
search.l^®2] Thus, the implemented GA is closer to Lamarckian than to Darwinian evolution, as 
the individuals evolve and pass on acquired and not inherited characteristics. Afterwards, the pop¬ 
ulation is extended by the newly optimized structures and, after ranking, the two individuals with 
highest energy are removed in order to keep the population size constant. 

Termination of the algorithm is reached if one of the convergence criteria is met: (i) the low¬ 
est energy has not changed more than a defined threshold (energy_diff_conv, default=0.001 eV) 
during a defined number of iterations (iter_limit_conv, default=10), or (ii) the lowest energy has 
reached a defined value (energy_wanted), or (iii) the maximal number of iterations (max_iter, 
default=10 ) has been reached. The convergence criteria are checked only after a defined number 
of iterations (iter_limit_conv, default=10). 

We are interested not only in finding the global minimum but also in finding low-energy local 
minima as many of them may be relevant. Thus, all of the generated conformers are saved and 
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are available for final analysis even if only a subset of them eonstitutes the final population. Ta¬ 
ble 1^ summarizes praetieal GA parameters that were employed for one of the referenee systems 
(isoleueine dipeptide). 

The parameters listed in Table [^ean be taken as indieative of settings that will work for many 
small to mid-size moleeules. A few exeeptions apply. Speeifieally, the max_iter and the popsize 
parameters are set to low values in Table eovering only a small set of struetures within an 
individual GA run. This ehoiee would be appropriate for an ensemble of many short independent 
GA run to generate a broad struetural ensemble with a bias towards the low-energy solution spaee. 
For larger and more eomplex moleeules, and/or for runs designed to identify the global minimum in 
a single shot, max_iter eould be inereased signifieantly, and popsize eould be inereased somewhat 
(to 10-20 individuals) as well. Likewise, the mutation probabilities prob_for_mut_cistrans and 
prob_for_mut_rot are here set to relatively high values of 0.5, instilling a signifieantly amount of 
randomness into the seareh proeess. For a more "deterministie" seareh proeess, somewhat smaller 
values (e.g., 0.2) might be ehosen. Finally, the distance_cutoff eriteria are ehosen to be appropriate 
for light elements (first and seeond row); adjustments may be appropriate if heavier eovalently 
bonded atoms are ineluded in the seareh. 

DFT calculations 

For the tests presented below, all DFT ealeulations are performed with the FHI-aims eode.l^^tZfil 
We employed the PBE funetional^ with a eorreetion for van der Waals interaetions (pairwisel^ 
for the amino aeid dipeptides ealeulations and MBdI^ for the drug-like ligands) and with light 
eomputational settings and tier 1 basis set.l^ For the loeal optimization, we use a trust radius en- 
haneed version of the Broyden-Fleteher-Goldfarb-Shanno (BFGS) algorithmic initialized by the 
model Hessian matrix of Lindh.C This is the default ehoiee in the FHI-aims eode and was im¬ 
plemented by Jurgen Wieferink. The loeal optimization is set to terminate when the maximum 
residual foree eomponent per atom is equal to 5 ■ 10 ^ eV/ A. Density funetional, basis set, and 
numerieal settings (e.g. integration grids) are user ehoiees of the underlying density-funetional 
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Table 2: GA parameters for isoleueine dipeptide 



Parameter 

Value 


SMILES 

CC(=0)N[C @ H] (C(=0)NC) [C @ H] (CC)C 


distance_cutoff_l 

1.2 A 

Molecule 

distance_cutoff_2 

2.0 A 


rmsd_cutoff_uniq 

0.2 A 


chiral 

True 


max_iter 

10 

Run settings 

iter_limit_conv 

10 


energy_diff_conv 

0.001 eV 


popsize 

5 


energy_var 

0.001 eV 


selection 

roulette wheel 


fitness_sum_limit 

1.2 


prob_for_crossing 

0.95 

GA settings 

cross_trial 

20 


prob_for_mut_cistrans 

0.5 


prob_for_mut_rot 

0.5 


max_mutations_cistrans 

1 


max_mutations_torsions 

2 


mut_trial 

100 


theory eode and must be set appropriately outside of Fafoom. The settings for numerieal eon- 
vergence (ineluding basis set) must be chosen converged enough but not introduce artifacts in the 
landscape of minima found. The choice of the density-functional approximation (DFA) to the ex¬ 
act Born-Oppenheimer potential-energy surface needs to reproduce the local energy minima of the 
PES faithfully, as discussed in the introduction. We here only note that costs for different electronic 
structure approximation can vary by orders of magnitude. In practice, and strictly speaking, the 
scope of our algorithm is to find the PES minima for a given DET functional, while the physical 
choice of the "right" DEA is not the focus of this paper. We do show, however, that we can use 
our algorithm in practice with one specific functional, PBE functional with a correction for van der 
Waals interactions, that has yielded very reliable results in the past. 


Parallelization 

Parallel computational resources can be utilized in two ways in order to speed up the computation. 
Eirst, multiple GA runs can be started in parallel and the blacklist can be shared between different 
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and subsequent runs. Sharing the blaeklist inereases diversity of solutions with already a few GA 
runs. Seeond, the time needed for the individual energy evaluations ean be deereased if the molee- 
ular simulation paekage allows ealeulations aeross distributed nodes and is effieiently parallelized 
(e.g. in FHI-aims^. Our eode supports both modes of eomputation. 

Availability of the code 

The eode is distributed as a python paekage named Fafoom (Flexible algorithm for optimization 
of moleeules) under the GNU Lesser General Publie Lieense.l^ It is available from following 
websites: 

• https://aimselub.fhi-berlin.mpg.de/aims_tools.php 

• https://github.eom/adrianasupady/fafoom 

Although designed for usage with a first-prineiples method (e.g. FHI-aims, NWChem^^, Fafoom 
ean also be used with a foree field (MMFF94 ® aeeessed within RDKitl^^I^ for testing purposes. 
It is in prineiple possible to use any moleeular simulation paekage whieh outputs optimized ge¬ 
ometries together with their energies. Nevertheless, this requires adjusting a part of the program to 
the speeifie needs of the used software. Details are provided with the program’s doeumentation. 

Reference data 

In order to evaluate the several aspeets of the performanee of the implemented algorithm we use 
two sets of referenee data. The first referenee data set (Amino acid dipeptides) was extracted 
from a database of computational data for the amino acid dipeptides. The second reference data 
set (Mycophenolic acid) contains conformers of a drug-like ligand that were obtained with three 
different search techniques. 
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Amino acid dipeptides 


The first reference data set contains conformers of seven amino acid dipeptides^^ (Figure and 
was extracted from a large database for amino acid dipeptide structures generated in a combined 
basin-hopping/multi-tempering based search. In that search (published in detail in^^, the frame¬ 
work of the reference search can be divided into a global search step and a refinement step. In the 
global search, the basin hopping search technique together with an empirical force field OPLS-AA 
was employed to perform the initial scan of the PES. The identified energy minima were relaxed 
at the PBE-i-vdW level with light computational settings in EHI-aims. In the refinement step, ab 
initio replica-exchange molecular dynamics runs were performed to locally explore the conforma¬ 
tional space and to alleviate a potential bias of the initial search of a force field PES. The resulting 
minima were again optimized at the PBE-i-vdW level with tight computational settings and with 
the tier 2 basis set.^^In order to compare to our data, they were re-optimized with the same func¬ 
tional with light computational settings, and the tier 1 basis set.^ After this procedure, duplicates 
were removed from the set used for the comparison with the GA results. Eor benchmarking the 
performance of our search strategy for conformers predictions, we consider all structures with a 
relative energy up to 0.4 eV. These conformers define the reference energy hierarchy for each of 
the selected dipeptides. We summarize some characteristics and the number of conformers that 
were considered in Table |3l 

Table 3: Reference data set: seven amino acid dipeptides. 


Amino acid dipep¬ 
tide 

Abbr. 

No. of atoms 

No. of rotatable bonds -i- 
No. of cis/trans bonds 

No. of conformers (be¬ 
low 0.4 eV « 38.6 
kJ/mol) 

Glycine 

Gly 

19 

2+2 

15 (15) 

Alanine 

Ala 

22 

2+2 

28 (17) 

Phenylalanine 

Phe 

32 

A+2 

64 (37) 

Valine 

Val 

28 

3+2 

60 (40) 

Tryptophan 

Trp 

36 

4+2 

141 (77) 

Leucine 

Leu 

31 

4+2 

183 (103) 

Isoleucine 

He 

31 

4+2 

176 (107) 
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Figure 2: Chemical structures of the amino acid dipeptides. Rotatable bond are single, non-ring 
bonds between non-terminal atoms that are not attached to methyl groups that carry three identical 
substituents and are marked in red. Double arrows mark the cis/trans bond. 

Mycophenolic acid 

From the Astex Diverse Set,l^ a collection of X-ray crystal structures of complexes containing 
ligands from the Protein Data Bank (PDB), one example for a drug-like ligand was selected. 
This molecule, mycophenolic acid (target protein: IMEH) has 43 atoms, 8 rotatable bonds and 
1 cis/trans bond (Figure]^. 



Figure 3: The chemical structure of the selected ligand together with the PDB-ID of the respective 
X-ray structure of the target protein. Rotatable bonds are marked in red and the cis/trans bond is 
marked with double arrows. 

Mycophenolic acid is a very flexible molecule. Even a coarse systematic search with a grid 
of only 60° for the freely rotatable torsions and 2 values (cis/trans) for the double bond and the 
X-X-O-H torsions yields already 6^ *2*2*2 = 373248 conformations to test. This makes this 
molecule a challenging example to test the performance of three search techniques (A-C below) in 
combination with first-principles methods. 
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A) Genetic algorithm. 50 independent GA runs with following settings: max_iter=30, iter_limit_conv=20 
and popsize=10, were performed with Fafoom. A total of 3208 struetures were generated. 

B) Random search. 3200 random and elash-free structures were generated with Fafoom and 
further relaxed with DFT. 

C) Systematic search with Confafi^. First, 293 conformers were generated with Confab (as¬ 
sessed via Open Babel, used settings: RMSD cutoff = 0.65 and Energy cutoff = 15 kcal/mol). 

In order to account for two different values for the cis/trans bond and the X-X-O-H torsions (0° 
and 180°), eight starting structures per each of the conformers generated with Confab were con¬ 
sidered. This procedure yields overall 2344 structures. After removing geometries with clashes, 

2094 structures were subjected to DFT optimization. 

Finally, all DFT optimized structures were merged to a common pool and the duplicates were 
removed. For this, a two-step criterion was used. First, the compared structures need to have 
a torsional RMSD (tRMSD) lower than 0.1 ;r rad.^ Second, the energy difference between the 
compared structures cannot exceed 10 meV. If both criteria are met, the structure that is higher 
in energy is labelled as ’duplicate’ and is removed from the pool. In total, 1436 unique structures 
were found. Table SI shows the number of the obtained unique structures depending on the applied 
energy cutoff. 

Results and discussion 

The performance of the GA search is evaluated by the ability to reproduce the reference energy 
hierarchies and to find the global minimum. We performed multiple GA runs for the test systems 
to test the impact of varying search settings. 

Amino acid dipeptides 

For each of the amino acid dipeptides we performed 50 independent GA runs with 10 iterations 
(max_iter) each and a population size of 5 (popsize). One GA run with such settings requires 
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popsize + 2 ■ iterations = 25 geometry optimizations at the PBE+vdW level and yields 25 eon- 
formers. 

Finding the global minimum 

First we assess the probability to find the global minimum (known from the referenee energy hier- 
arehy) among them. We eheek how many of the GA runs sueeeed in finding the global minimum 
and subsequently ealeulate the probability for finding the global minimum in one GA run and 
present the results in Table 


Table 4: Average (from 50 GA runs) probability for finding the energy global minimum in a given 
run with 25 loeally optimized eonformers. 


Molecule 

Gly 

Ala 

Phe 

Val 

Trp 

Leu 

He 

TDOFs 

4 

4 

6 

5 

6 

6 

6 

Probability for global 
minimum (/I run) 

0.82 

0.79 

0.53 

0.60 

0.22 

0.20 

0.10 


Table illustrates how the magnitude of the sampling problem does not only depend on the 
dimensionality, i.e. here the number of TDOFs, but also on the ehemieal strueture. Phenylalanine 
and isoleueine are two interesting eases, both have the same number of TDOFs and are of similar 
size, but the probability of finding the global minimum with a single run drops dramatieally. The 
drop in probability is, of eourse, eorrelated with the overall number of eonformers listed in Table 

Conformational coverage 

A key point in our approaeh is to reproduee the known energy hierarehies of the referenee systems. 
For eaeh of the investigated eompounds, we randomly ehoose 5, 10, 15, 20, and 25 runs (from the 
pool of 50 runs), merge the results, and eheek how many struetures have been found. We repeat 
this proeedure 10,000 times and present the results in Figure]^ 

It is evident that for dipeptides with a small number of referenee minima (alanine and glyeine) 
we obtain a very good results, i.e. a very good eoverage of eonformational spaee, already with 5 
repeats of the GA runs. For dipeptides with a slightly higher number of minima (phenylalanine and 
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Figure 4: Number of minima found by the GA for seven amino acid dipeptides. The horizontal 
lines depict the total number of minima for the given compound as predicted by Ropo et al^ 
From a total of 50 GA runs, 5, 10, 15, 20, 25 GA runs were randomly selected and the found 
structures were counted. This procedure was repeated 10,000 times and the resulting distributions 
are summarized in box plots. The line inside the box is the median, the bottom and the top of the 
box are given by the lower (Qo.zs) and upper (2o.75) quartile. The length of the whisker is given by 
1-5 ■ (2o.75 — 2o.25)- Outliers (any data not included between the whiskers) are plotted as a cross. 

valine) at least 10 runs of the GA are needed to obtain a good result. For the remaining dipeptides, 
the GA is not able to find all of the reference minima, even with 25 GA runs. However, the 
coverage of the reference hierarchy with 20 GA runs is always higher than 80%. We next inspect 
in more detail which of the amino acid dipeptides’ reference minima were missed. To this end 
we investigate the actual difference between the reference hierarchy and the hierarchy obtained 
from the 50 GA runs, see Figure Although our search strategy misses a few of the reference 
structures even when 50 repeats of the GA search are performed, the first missed structure has 
a relative energy higher than 0.2 eV. This in turn means that no low-energy structures are being 
missed. Furthermore, there are multiple newly predicted structures that were not present in the 
reference data set (Figure [^. It should be noted that, considering the fact that the investigated 
GA runs are rather short, the random component of the search (randomly initialized populations) 
contributes to the good results of the search. 
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missed by the GA 
new found by the GA 



Figure 5: Difference hierarchies for the amino acid dipeptides. Red lines depict structures from 
the reference data set that have not been found by the GA. Green lines depict structures found by 
the GA that were absent in the reference data set. Gray lines depict structures from the reference 
data set that were found by the GA. The results from all 50 GA runs for each dipeptide were taken 
into account. 


Parameter sensitivity 

In order to check the robustness of the default run parameters, several alternative settings were 
tested for the isoleucine dipeptide. The tested parameters include: (i) the impact of the selection 
mechanism (roulette wheel, reverse roulette wheel, random), (ii) the effect of decreasing the cut¬ 
off for blacklisting from the default value of 0.2 A to 0.05 A, and (iii) the increase of the maximal 
number of iterations from the default 10 to 15, 20 and 25. For cases (i) and (ii) 100 GA runs were 
performed for each of the settings. In order to assess the effect of the number of iterations, 100 
runs with a maximal number of iterations equal to 25 have been performed and subsequently only 
considered up to a maximum of 15, 20, 25 maximum iterations. Additionally, 50 GA runs with a 
maximal number of iterations equal to 100 were performed. In all mentioned cases convergence 
criteria were evaluated after each iteration, starting from the iter_limit_conv=10th iteration. 

We find that none of the three selection mechanisms has a distinct impact on the probability 
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for finding the global minimum or quality of the eonformational eoverage. Similarly, no substan¬ 
tial change was observed upon the decrease of the blacklisting cut-off. The probability value for 
finding the global minimum as well as the number of found reference minima increases with in¬ 
creased number of iterations. This is simply due to the increased number of trials for sampling the 
conformational space. Table summarizes the probability to find the global minimum in one run 
with different settings. Detailed data about the conformational coverage is given in Figure S2. 

Table 5: Probability of finding the global minimum of isoleucine in one run for different setups. 
The default settings include roulette wheel selection mechanism, 0.2 A cut-off for the blacklisting 
and maximal number of iteration equal to 10. The numbers in brackets denote the mean number 
of iterations needed for convergence. 


Setup 

Probability of finding the 



global minimum (per run) 

default 

0.17 


roulette wheel reverse 

0.18 

Selection mechanism 




random 

0.13 


15 (13) 

0.20 


20(15) 

0.25 

Max. number of iterations 




25 (16) 

0.25 


100 (22) 

0.46 

Cut-off for blacklisting 

0.05 A 

0.14 


Evaluation of the computational performance 

The accuracy of a search/sampling strategy is its most crucial feature. Nevertheless, its compu¬ 
tational cost plays a significant role in practical applications. To this end, we quantify the total 
cost of the GA runs in terms of force evaluations required in the local geometry optimizations. 
The number of force evaluations, i.e. most expensive steps in the algorithm, is a suitable measure 
for the computational cost. One force evaluation requires approximately between 1 (glycine) to 3 
(tryptophan) CPUminutes. We quantify the number of force evaluations required by the GA for 
reproducing 85% of the reference hierarchy and present the results in Table The table also in¬ 
cludes the number of force evaluations required only in the replica-exchange MD refinement step 
of the reference search (the number of force evaluations required for the geometry optimizations 
is not even included). 


21 










Table 6: Comparison of the eomputational eost: amino aeid dipeptides. The eost is given in the 
total number of foree evaluations [*10^]. 




Total number of force evaluations[=t 

10^] 



Gly 

Ala 

Phe 

Val 

Trp 

Leu 

He 

GA (at least 85% reproduction of 

11 

12 

29 

24 

60 

68 

61 

the reference hierarchy) 








Reference 

380 

400 

480 

440 

500 

460 

460 


Mycophenolic acid 

In the following we utilize as reference a set of structures that is a result of merging all structures 
found by three techniques: 3208 structures from 50 GA runs, 3200 random structures and 2094 
structures generated with Confab. We define the following subsets: (i) ’GA’ is a random selection 
of 25 GA runs (approx. 1600 structures); (ii) ’SYS (CONFAB)’ is a set of all 2094 structures 
generated in the systematic search; and (iii) ’RANDOM’ is random selection of 1600 structures 
generated in the random search. For the performance evaluation we count how many of the ref¬ 
erence structures can be found by the respective search technique. This procedure was repeated 
1000 times for each of the energy cutoffs. The results are shown in FigureMore details can be 
found in Table S1. 

All of the search techniques found the same global minimum several times. In case if no energy 
cutoff is applied, none of the searches is able to find all local minima in the conformational space 
(i.e. more calculation would be needed). With a decreasing energy cutoff, an improved coverage of 
the conformational space can be observed. The fact that the GA is a global optimization techniques 
is clearly visible as it performs better in the low-energy (< 0.2 eV) region, whereas the random 
and systematic search perform uniformly but not perfectly independent of the energy cutoff used 
for the evaluation. 

In order to show the wide and routine applicability of our first-principles structure search ap¬ 
proach, we have performed short exploratory structure searches (only 3 GA runs each) to eight 
drug-like ligands from the Astex Diverse Set, which is widely utilized to assess the performance 
of, for example, conformer generators. The molecules vary in the size (8 to 32 atoms) and number 
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Energy cutoff [eV (kJ/mol)] 

Figure 6: Share of the reference number of structures found by three search techniques: GA (blue 
circles), random search (red squares) and systematic search with Confab (black triangles) as a 
function of the applied energy cutoff. Energy values are given in eV and in parentheses also in 
kJ/mol. 
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of rotatable bonds (6 to 13). A detailed analysis of this study is shown in the Supporting Informa¬ 
tion. In brief we find that in all eight instanees a diverse pool of eonformers ean be generated. In 
eaeh ease, a eonformer is found that is similar to the protein-bound ligand from the X-ray strue- 
ture with an RMSD of 1.5 A. In six of the eight instanees, they are similar with RMSD values of 
less than 0.9 A. Exploratory first-prineiples strueture searehes have a potential applieation in in 
silico protein-ligand studies:^ the eomparison of the struetural spaee of the isolated ligand and the 
strueture realized by the protein-bound ligand might reveal details about the binding proeess, for 
example whether the binding meehanism follows more the eonformational-seleetion or indueed-fit 
type. In eontrast to many of the quieker (but simpler) established eonformer generators, the first 
prineiples energeties that we obtain here are not dependent on initial parametrizations and thus the 
method is in prineiple applieable throughout ehemieal spaee. It is important to note that, in this 
test, our goal was not to provide a eonverged GA seareh for eaeh moleeule but rather to explore 
the GA’s potential to provide approximate eonformational eoverage with a fixed eomputational 
budget. Our investigation of myeophenolie aeid indieates that searehes for eaeh of these moleeule 
eould be reliably eonverged albeit at signifieantly higher eomputational expense. 

Literature context 

In order to put the algorithm’s parameters into perspeetive, we eompare it to four seleeted appliea- 
tions of EA or GA to the eonformational seareh of moleeules in the following. In all eonsidered 
algorithms, the initial populations are generated randomly and the eonformational spaee of the 
respeetive moleeules is represented and sampled (by mutation and erossing-over) by means of tor¬ 
sion angles, i.e. rotations around bonds. Table |7] summarizes a few parameters that illustrate the 
range over whieh the parameters that are eharaeteristie to these kinds of evolutionary or genetie 
algorithms ean vary. The approaehes differ in the energy funetions that are employed: Damsbo 
et al.^ employ the CHARMM foree field, Vainio and Johnson^^ use the torsional and the vdW 
term of the MMEE94 foree field separately in a multi-objeetive genetie algorithm (MO-GA) while 
Nair and Goodman^ use the MM2* foree field. The study on optimizing the GA parameters 
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for molecular search with a meta-GA, presented by Brain and Addicoat,!^ uses, similar to our 
work, a first-principles energy functions. Two choices in the algorithm highlight the difference 
between theirs and our aim: in order ”to reliably find the already known a priori correct answer 
with minimum computational resources", the selection criterion ’rank’ focuses on the generation’s 
best solution. Furthermore, crossing-over is considered as not helpful. In contrast, the aim of 
our work is to provide a GA implementation that ensures broad conformational coverage, i.e. the 
prediction of an energy hierarchy and not only the reproduction of a global optimum. For that we 
found it useful to employ random or roulette-wheel selection that also accepts less-optimal struc¬ 
tures for genetic operations and a high probability for crossing-over. Both choices (accompanied 
by blacklisting) can be interpreted as means to increase diversity during the search. 


Table 7: Comparison of parameters and schemes that are used in search approaches proposed in 
four selected publications with the approach presented in this work. 


Parameter 

Damsbo’04l^ 

Vainio’07«0 

Nair’OSSS 

Brain’ 11'2^ 

this work 

Algorithm type 

EA 

MO-GA 

GA 

GA 

GA 

Population size 

30 

20 

2-20 

10-15 

5 

Selection 

- 

tournament 

roulette 

rank 

roulette 

Crossing-over probability 

- 

0.9 

1.0 

O.O-l.O 

0.95 

Mutation probability 

- 

0.05 

0.4 

0.3-0.5 

0.5 


Conclusions 

We aimed at designing a user-friendly framework with an implementation of the genetic algorithm 
for searches in molecular conformational space that is particularly suitable for flexible organic 
compounds. A SMILES code for the selected molecule is the only required input for the algorithm. 
Furthermore, a wide selection of parameters (e.g. torsion definition, blacklist cut-off) allows for 
customizing the search. With minor changes, the code can be interfaced to external packages 
for molecular simulations that output optimized geometries together with corresponding energies. 
Besides its adaptability and ease of use, a further advantage of the implementation is the fact 
that it allows for using first-principles methods. With this, a potential bias resulting from the 
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parametrization of a particular force-field can be avoided and makes the seareh applicable to a 
broad selection of problems. We examined the performanee of the implementation in terms of 
effieiency and accuraey of the sampling. The algorithm is eapable of reprodueing the reference data 
with a high aeeuraey. For a set of amino acid dipeptides, we show that this eonformational eoverage 
is aehieved mueh more effieiently than in an earlier, ab initio repliea-exehange MD based seareh in 
our group. For a larger molecule (myeophenolic aeid), we show that the low-energy eonformational 
space coverage of the GA surpasses the coverage of two eompeting methods signifieantly at similar 
effort. 
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Crossover procedure 


The following seheme (Figure]^ explains the erossover proeedure described in the Methods see- 
tion of the artiele. 


Parent 1 

Cis/trans Torsions: 
bonds: 

ct1 ct2 t1 ^ 13 t4 


Parent 2 


Ict1*ct2* t1*|,t2* 13*14* 


Child 1 


Crossing-over 


Child 2 


at cl2 11 12*13*14*1 Icl1*cl2*t1*l2 13 141 


Figure 7: Crossing-over proeedure. The lists of values for cis/trans bonds and torsions are first 
combined together. As next, a single eut is performed and the eorresponding parts are exehanged. 
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Conformational coverage: Isoleucine 

Figure shows the coverage of the conformational space of isoleucine dipeptide resulting from 
GA searches with different settings. 


Conformational coverage: Mycophenolic acid 

Table contains the total number of mycophenolic acid conformers found by three conformer 
generation techniques depending on the energy cutoff. Further, it contains detailed data used to 
create Figure 6 in the article. As there are three search techniques there are eight possible cases 
for finding a structure or not: (i) a structures can be found by all searches (’ah’) or (ii) a structure 
can be found by two of three searches {'GA+RANDOM'GA+SYS', 'SYS+RANDOM') or (iii) 
a structure can be found by only one search {’only GA’, ’only RANDOM’, ’only SYS’) or (iv) a 
structure can be missed by all of the searches (not included in the table). 

Table 8: Share of found structures by the GA, systematic search and random conformer generation 
depending on the used energy cutoff. 


Energy 
cutoff (eV) 

/ (kJ/mol) 

Nr. of 

struc¬ 

tures 

Unique 

struc¬ 

tures 

only 

GA 

only 

RAN¬ 

DOM 

[%] of 

only 

SYS 

structures i 

GA -r 

RAN¬ 

DOM 

ound by 

GA -r 

SYS 

RAN¬ 
DOM -r 

SYS 

all 

no 

8502 

1436 

6.28 

14.00 

19.23 

6.45 

5.20 

17.00 

19.65 

0.5 (48.2) 

7164 

1006 

7.18 

12.04 

15.26 

8.13 

5.90 

16.05 

24.82 

0.4 (38.6) 

6288 

764 

8.02 

10.74 

12.95 

9.30 

6.17 

13.89 

28.90 

0.35 (33.8) 

5452 

625 

8.43 

10.30 

12.69 

10.22 

6.39 

12.29 

30.07 

0.3 (28.9) 

4961 

531 

9.10 

9.97 

11.53 

11.36 

6.62 

10.92 

31.00 

0.25 (24.1) 

4535 

458 

9.53 

8.39 

11.23 

12.38 

7.16 

10.16 

32.37 

0.2 (19.3) 

4079 

345 

8.53 

6.79 

9.52 

14.27 

7.78 

8.79 

37.96 

0.15 (14.5) 

3153 

225 

9.51 

5.64 

8.17 

15.71 

7.49 

6.37 

41.53 

0.1 (9.6) 

1314 

74 

7.04 

4.33 

6.56 

14.07 

7.36 

4.67 

51.68 

0.05 (4.8) 

298 

19 

6.87 

2.81 

1.04 

15.10 

19.05 

2.28 

51.31 
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Figure 8: Conformational coverage of the GA search with different settings for He. The horizontal 
lines depict the number of reference minimal^ (107). From a total of 100 GA runs, 5, 10, 15, 20, 
25, 30, 35 GA runs were randomly selected and the found structures were counted. This procedure 
was repeated 10,000 times and the resulting distributions are summarized in box plots. The line 
inside the box is the median, the bottom and the top of the box are given by the lower (Qo.zs) 
and upper (Qojs) quartile. The length of the whisker is given by 1.5 ■ (2o.75 ~ Qo.is)- Any data 
not included between the whiskers is plotted as an outlier with a cross. Conformational coverage 
hardly changes by using different selection mechanisms (A) or changing the blacklisting cut-off 
(B). (C) Increasing the number of GA iterations improves conformational coverage. 
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A small set of flexible organic molecules 

We utilize the Astex Diverse Set,l^ a collection of structures obtained from X-ray crystal structures 
from the Protein Data Bank (PDB), to construct the a small set of flexible organic molecules. The 
goal here is not to provide a converged GA search for each molecule but rather to explore the GA’s 
potential to provide approximate conformational coverage with a fixed computational budget. Our 
investigation of mycophenolic acid indicates that searches for each of these molecule could be 
reliably converged albeit at significantly higher computational expense. 

We selected 8 ligands (Figure]^ that differ in composition, the number of heavy atoms (15-32) 
and the number of rotatable bonds (6-13).^ 



Figure 9: Chemical structures of the selected ligands together with the PDB-IDs of the respective 
X-ray structures of the target proteins. Rotatable bonds are marked in red. 


Genetic algorithm searches 

SMILES codes for the respective entries were taken from PubChem to ensure an unbiased start¬ 
ing point for the search. We utilize the parameter values as listed in Table 2 in the article with 
following exceptions: max_iter=30, iter_limit_conv=20, popsize=10, prob_for_mut_rot=0.8, 
prob_for_mut_cistrans=l, cross_trial=100 and max_mutations_torsion=3. For each of the 
molecules, three GA runs have been performed, i.e. given the settings, the number of obtained 
conformers cannot be higher than 210. 
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Results 


We present the summary of the results in Table The number of found eonformers is obtained 
after removing duplieates among all obtained eonformers. For eaeh of the moleeules, we ealeulate 
the RMSD between the non-hydrogen atoms of eaeh of the obtained struetures and the referenee 
ligand (Figure [T0}4) (hydrogens in the Astex Diverse Set set are the result of modeling and not 
part of the experimental result). With this, we identify the best match, i.e. the eonformer whieh is 
most similar to the referenee ligand. Furthermore, the reference ligand structures were optimized 
with DFT and are added to the respective plots for completeness. Figure [TO^ shows the overlay 
between the reference ligand (before the DFT optimization) and the best match for all molecules. 

Table 9: Selected ligands from the Astex Diverse Set. The number of found eonformers is obtained 
after removing duplicates among all obtained eonformers. The best match is the eonformer which 
is most similar to the reference ligand. 


Target 

protein 

No. of 

heavy 

atoms 

No. of 

rotatabfe 

bonds 

No. of 

found con- 
formers 

RMSD (A) 
the ligand at 

best match 

between 

id the 

GA mini¬ 
mum 

AE (eV) bel 
GA minimui 

best match 

ween the 

tn and the 

optimized 

ligand 

1G9V 

25 

8 

70 

1.43 

1.66 

0.536 

0.035 

IHVY 

32 

fO 

176 

1.2 

2.73 

0.707 

0.505 

IJLA 

27 

7 

41 

0.56 

1.44 

0.339 

0.268 

1R55 

23 

13 

116 

0.88 

1.59 

0.315 

0.326 

1SQ5 

i5 

10 

152 

0.77 

2.14 

0.661 

0.555 

lUNL 

26 

9 

166 

0.65 

2.23 

0.076 

0.026 

1V48 

22 

6 

118 

0.72 

2.23 

0.696 

0.722 

2BRi 

29 

8 

73 

0.28 

0.63 

0.005 

0.002 


For all investigated systems, we obtain a large number of eonformers that are spread over a 
wide energy window. This satisfies our primary goal of obtaining a diverse set of eonformers with 
a reliable energy hierarchy in a straightforward fashion. Moreover, in most of the cases, the RMSD 
between the best match and reference ligand is satisfactory (i.e. below 1.0 A). Here we would like 
to note, that our energy evaluations are performed in the gas phase while the reference ligand is 
obtained from a X-ray crystal structure. The energy of the best match is significantly higher than 
the energy of the GA minimum in most of the cases. This finding supports the need for providing a 
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B 


RMSD to the reference ligand (A) 






Figure 10: Evaluation of the results for the subset of the Astex Diverse Set. (A) Relative energy of 
all found conformers as a function of the RMSD to the reference ligand (blue circles). The green 
squares depict the reference ligand structures after DFT optimization. (B) An overlay between the 
reference ligand (green) and the best match (blue) is presented together with the corresponding 
RMSD value. 
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broad range of conformers instead of only foeusing on the global minimum of the partieular energy 
funetion. 

A few eases require further analysis. For two ligands, with the targets 1G9V and IHVY, the 
RMSD values between the best match and the referenee ligand strueture exeeed the threshold 
value (1.0 A). One possible reason is the faet, that the referenee ligand is not a minimum on the 
PES sampled by the GA. Another trivial eause might be the insuffieient number of the performed 
GA runs. 

Further we note that the optimization of the orientation of the hydroxy groups is required for 
obtaining a meaningful eonformational ensemble. 

Apart from performing short GA runs, one long GA run has been performed for eaeh of the 
seleeted moleeules for eomparison. In order to obtain eomparable results (by means of the num¬ 
ber of performed DFT optimization), following parameters have been adjusted: max_iter=80, 
iter_limit_conv=70. We eompare the results in terms of: (i) energy of the most stable strueture, 
(ii) matehing the referenee ligand and (iii) number of found eonformers. Detailed data about the 
results of the single long runs are given in Table [T0| In terms of finding the best match, the long GA 
run performs better than the three short GA runs together for some of the moleeules. On the other 
hand, the number of found struetures is signifieantly higher if three short GA runs are performed 
instead of a single long run for most of the moleeules. 

The results of the exploratory strueture searehes for the 8 drug-like ligands suggest that per¬ 
forming one long GA run instead of 3 short GA runs may inerease the ehanee for finding the global 
minimum and simultaneously deerease the number of identified unique eonformers. 
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Table 10: Comparison of the results obtained from one long GA run (max. 80 iterations) and three 
short GA runs (eaeh max. 30 iterations). AE is the differenee between the most stable struetures 
found in the eompared setups. 


Molecule 

Number of found conform- 

RMSD (A) between the lig¬ 

AE (eV) 


ers 


and and the best match 



1 X max. 80 

3 X max. 30 

1 X max. 80 

3 X max. 30 



iterations 

iterations 

iterations 

iterations 


1G9V 

45 

70 

0.57 

1.43 

0.0 

IHVY 

146 

176 

1.07 

1.2 

0.211 

IJLA 

40 

41 

0.62 

0.56 

-0.005 

1R55 

85 

116 

0.69 

0.88 

0.081 

1SQ5 

99 

152 

0.64 

0.77 

0.031 

lUNL 

93 

166 

0.8 

0.65 

0.003 

1V48 

115 

118 

0.81 

0.72 

0.054 

2BR1 

47 

73 

0.28 

0.28 

-0.005 
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